{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 进阶RAG检索  MultiQueryRetriever  "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 准备工作（加载数据、定义embedding模型、向量库）"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "from langchain.text_splitter import RecursiveCharacterTextSplitter\n",
    "from langchain.document_loaders import PyPDFLoader\n",
    "from langchain.embeddings.openai import OpenAIEmbeddings\n",
    "from langchain.vectorstores import Chroma\n",
    "\n",
    "\n",
    "# Load pdf\n",
    "loader = PyPDFLoader(\"E:\\\\langchain_RAG\\\\data\\\\baichuan.pdf\")\n",
    "data = loader.load()\n",
    "\n",
    "# Split\n",
    "text_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=0)\n",
    "splits = text_splitter.split_documents(data[:6])\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "import os\n",
    "from getpass import getpass\n",
    "\n",
    "OPENAI_API_KEY = getpass()\n",
    "\n",
    "os.environ[\"OPENAI_API_KEY\"] = OPENAI_API_KEY\n",
    "\n",
    "# VectorDB\n",
    "embedding = OpenAIEmbeddings()\n",
    "vectordb = Chroma.from_documents(documents=splits, embedding=embedding)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### MultiQueryRetriever"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Set logging for the queries\n",
    "import logging\n",
    "\n",
    "logging.basicConfig()\n",
    "logging.getLogger(\"langchain.retrievers.multi_query\").setLevel(logging.INFO)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [],
   "source": [
    "from langchain.retrievers.multi_query import MultiQueryRetriever\n",
    "from langchain.chat_models import ChatOpenAI\n",
    "\n",
    "question = \"what is baichuan2 ?\"\n",
    "llm = ChatOpenAI(temperature=0)\n",
    "retriever_from_llm = MultiQueryRetriever.from_llm(\n",
    "    retriever=vectordb.as_retriever(), llm=llm\n",
    ")\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "INFO:langchain.retrievers.multi_query:Generated queries: ['1. Can you provide information about baichuan2?', '2. What can you tell me about baichuan2?', '3. Could you explain the concept of baichuan2 to me?']\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "5"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "docs = retriever_from_llm.get_relevant_documents(query=question)\n",
    "len(docs)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[Document(page_content='In this technical report, we introduce Baichuan\\n2, a series of large-scale multilingual language\\nmodels. Baichuan 2 has two separate models,\\nBaichuan 2-7B with 7 billion parameters and\\nBaichuan 2-13B with 13 billion parameters. Both\\nmodels were trained on 2.6 trillion tokens, which\\nto our knowledge is the largest to date, more than\\ndouble that of Baichuan 1 (Baichuan, 2023b,a).\\nWith such a massive amount of training data,\\nBaichuan 2 achieves significant improvements over', metadata={'page': 1, 'source': 'E:\\\\langchain_RAG\\\\data\\\\baichuan.pdf'}),\n",
       " Document(page_content='demonstrates strong performance on medical and\\nlegal domain tasks. On benchmarks such as\\nMedQA (Jin et al., 2021) and JEC-QA (Zhong\\net al., 2020), Baichuan 2 outperforms other open-\\nsource models, making it a suitable foundation\\nmodel for domain-specific optimization.\\nAdditionally, we also released two chat\\nmodels, Baichuan 2-7B-Chat and Baichuan 2-\\n13B-Chat, optimized to follow human instructions.\\nThese models excel at dialogue and context\\nunderstanding. We will elaborate on our', metadata={'page': 1, 'source': 'E:\\\\langchain_RAG\\\\data\\\\baichuan.pdf'}),\n",
       " Document(page_content='Baichuan 1-13B-Base 52.40 51.60 55.30 49.69 43.20 43.01 26.76 11.5913B\\nBaichuan 2-13B-Base 58.10 59.17 61.97 54.33 48.17 48.78 52.77 17.07\\nTable 1: Overall results of Baichuan 2 compared with other similarly sized LLMs on general benchmarks. * denotes\\nresults derived from official websites.\\nFigure 1: The distribution of different categories of\\nBaichuan 2 training data.\\nData processing : For data processing, we focus\\non data frequency and quality. Data frequency', metadata={'page': 2, 'source': 'E:\\\\langchain_RAG\\\\data\\\\baichuan.pdf'}),\n",
       " Document(page_content='Baichuan 1. On general benchmarks like MMLU\\n(Hendrycks et al., 2021a), CMMLU (Li et al.,\\n2023), and C-Eval (Huang et al., 2023), Baichuan\\n2-7B achieves nearly 30% higher performance\\ncompared to Baichuan 1-7B. Specifically, Baichuan\\n2 is optimized to improve performance on math\\nand code problems. On the GSM8K (Cobbe\\net al., 2021) and HumanEval (Chen et al., 2021)\\nevaluations, Baichuan 2 nearly doubles the results\\nof the Baichuan 1. In addition, Baichuan 2 also', metadata={'page': 1, 'source': 'E:\\\\langchain_RAG\\\\data\\\\baichuan.pdf'}),\n",
       " Document(page_content='overall performance of the Baichuan 2 base models\\ncompared to other open or closed-sourced models\\nin Table 1. We then describe our pre-training data\\nand data processing methods. Next, we elaborate\\non the Baichuan 2 architecture and scaling results.\\nFinally, we describe the distributed training system.\\n2.1 Pre-training Data\\nData sourcing : During data acquisition, our\\nobjective is to pursue comprehensive data\\nscalability and representativeness. We gather data', metadata={'page': 1, 'source': 'E:\\\\langchain_RAG\\\\data\\\\baichuan.pdf'})]"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "docs"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [],
   "source": [
    "from langchain.prompts import PromptTemplate\n",
    "from langchain.chains import LLMChain\n",
    "\n",
    "\n",
    "\n",
    "template = \"\"\"基于以下提供的内容回答问题，如果内容中不包含问题的答案，请回答“我不知道”\n",
    "内容：\n",
    "{contexts}\n",
    "\n",
    "问题： {query}\n",
    "\"\"\"\n",
    "\n",
    "mulitquery_PROMPT = PromptTemplate(  input_variables=[\"query\", \"contexts\"], template=template,)\n",
    "\n",
    "# Chain\n",
    "qa_chain = LLMChain(llm=llm, prompt=mulitquery_PROMPT)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'query': 'what is baichuan2 ?',\n",
       " 'contexts': 'In this technical report, we introduce Baichuan\\n2, a series of large-scale multilingual language\\nmodels. Baichuan 2 has two separate models,\\nBaichuan 2-7B with 7 billion parameters and\\nBaichuan 2-13B with 13 billion parameters. Both\\nmodels were trained on 2.6 trillion tokens, which\\nto our knowledge is the largest to date, more than\\ndouble that of Baichuan 1 (Baichuan, 2023b,a).\\nWith such a massive amount of training data,\\nBaichuan 2 achieves significant improvements over\\n---\\ndemonstrates strong performance on medical and\\nlegal domain tasks. On benchmarks such as\\nMedQA (Jin et al., 2021) and JEC-QA (Zhong\\net al., 2020), Baichuan 2 outperforms other open-\\nsource models, making it a suitable foundation\\nmodel for domain-specific optimization.\\nAdditionally, we also released two chat\\nmodels, Baichuan 2-7B-Chat and Baichuan 2-\\n13B-Chat, optimized to follow human instructions.\\nThese models excel at dialogue and context\\nunderstanding. We will elaborate on our\\n---\\nBaichuan 1-13B-Base 52.40 51.60 55.30 49.69 43.20 43.01 26.76 11.5913B\\nBaichuan 2-13B-Base 58.10 59.17 61.97 54.33 48.17 48.78 52.77 17.07\\nTable 1: Overall results of Baichuan 2 compared with other similarly sized LLMs on general benchmarks. * denotes\\nresults derived from official websites.\\nFigure 1: The distribution of different categories of\\nBaichuan 2 training data.\\nData processing : For data processing, we focus\\non data frequency and quality. Data frequency\\n---\\nBaichuan 1. On general benchmarks like MMLU\\n(Hendrycks et al., 2021a), CMMLU (Li et al.,\\n2023), and C-Eval (Huang et al., 2023), Baichuan\\n2-7B achieves nearly 30% higher performance\\ncompared to Baichuan 1-7B. Specifically, Baichuan\\n2 is optimized to improve performance on math\\nand code problems. On the GSM8K (Cobbe\\net al., 2021) and HumanEval (Chen et al., 2021)\\nevaluations, Baichuan 2 nearly doubles the results\\nof the Baichuan 1. In addition, Baichuan 2 also\\n---\\noverall performance of the Baichuan 2 base models\\ncompared to other open or closed-sourced models\\nin Table 1. We then describe our pre-training data\\nand data processing methods. Next, we elaborate\\non the Baichuan 2 architecture and scaling results.\\nFinally, we describe the distributed training system.\\n2.1 Pre-training Data\\nData sourcing : During data acquisition, our\\nobjective is to pursue comprehensive data\\nscalability and representativeness. We gather data',\n",
       " 'text': 'Baichuan 2 is a series of large-scale multilingual language models. It includes two separate models, Baichuan 2-7B with 7 billion parameters and Baichuan 2-13B with 13 billion parameters. These models were trained on a massive amount of data, 2.6 trillion tokens, which is the largest to date. Baichuan 2 demonstrates strong performance on medical and legal domain tasks and outperforms other open-source models on benchmarks such as MedQA and JEC-QA. Additionally, there are two chat models, Baichuan 2-7B-Chat and Baichuan 2-13B-Chat, optimized for dialogue and context understanding.'}"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "out = qa_chain(  inputs={\"query\": question,  \n",
    "                         \"contexts\": \"\\n---\\n\".join([d.page_content for d in docs]) }\n",
    "                        )\n",
    "\n",
    "out"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "llangchainhf",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.0"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
